Continuous Prediction of Web User Visual Attention on Short Span Windows Based on Gaze Data Analytics

Understanding users’ visual attention on websites is paramount to enhance the browsing experience, such as providing emergent information or dynamically adapting Web interfaces. Existing approaches to accomplish these challenges are generally based on the computation of salience maps of static Web interfaces, while websites increasingly become more dynamic and interactive. This paper proposes a method and provides a proof-of-concept to predict user’s visual attention on specific regions of a website with dynamic components. This method predicts the regions of a user’s visual attention without requiring a constant recording of the current layout of the website, but rather by knowing the structure it presented in a past period. To address this challenge, the concept of visit intention is introduced in this paper, defined as the probability that a user, while browsing, will fixate their gaze on a specific region of the website in the next period. Our approach uses the gaze patterns of a population that browsed a specific website, captured via an eye-tracker device, to aid personalized prediction models built with individual visual kinetics features. We show experimentally that it is possible to conduct such a prediction through multilabel classification models using a small number of users, obtaining an average area under curve of 84.3%, and an average accuracy of 79%. Furthermore, the user’s visual kinetics features are consistently selected in every set of a cross-validation evaluation.


Introduction
Understanding the behavior of users' visual attention on websites is an active research area tackled by the fields of computer vision, human-computer interaction, and Web intelligence. By using information gleaned from this understanding, a website can be constructed to deliver information of greater utility and complexity, such as emergent recommendations in regions of a Web interface that capture users' attention, adapt visual stimuli to users' gaze patterns in real time, among other advantages. Traditional methods to study users' visual attention have historically focused on images. However, recently more complex visual stimuli have been considered, such as videos [1][2][3][4][5], virtual reality environments [6,7], egocentric videos [8,9], and websites. Unfortunately, the literature on users' visual attention on websites is generally based on salience maps computed for static Web interfaces [10][11][12][13], where the website structure is always known in advance [14]. Therefore, as dynamic and interactive websites become increasingly prevalent, traditional methods of evaluating visual attention, such as those based on static salience maps or pre-determined website structures, are becoming less effective.
It is imperative to gain a deeper understanding of users' visual attention on websites to create an optimal online experience. This requirement can be realized by dynamically predicting users' visual attention on specific regions of a website. This understanding offers several advantages, such as identifying regions of the website that capture users' attention and making them more prominent; adjusting the layout and design to match users' gaze patterns, or improving the user experience by creating a more intuitive and user-friendly interface. Furthermore, this understanding can aid in optimizing a website's layout to increase user engagement and conversion rates, and addressing usability issues, such as elements on the website that may be causing confusion or frustration.
Users' own browsing actions are the most significant source of difficulty in predicting their visual attention in current dynamic websites. For instance, these actions may yield the display of menus, pop-ups, or banners when clicking certain regions, or movement of the webpage when using the lateral scroll. Given that the interaction with a website generates constant changes in its layout, there exists the need for a prediction that identifies the regions of a user's visual attention without requiring the constant recording of the current layout but knowing the structure it presented in a past period. To address this challenge, this paper introduces the concept of visit intention, defined as the probability that a user, while browsing, will fixate on a specific region of the website in the next period. These regions are known as areas of interest (AOIs) and are defined by specific groupings of the website components.
In this paper, visit intentions to AOIs are obtained through multilabel classifiers that leverage different features extracted from eye-tracking data. The first set of features comprises a user's gaze fixations in different periods over the components of AOIs. Since generally little information per user is available, our approach generates a population-level classification model to improve generalization and reduce the risk of overfitting from training models with few samples. Furthermore, this means that a subset of these features embeds information about the behavior of other users in that of a specific participant. That is, visit intention prediction models integrate, in the first instance, the features calculated from a population-level gaze pattern, allowing a robust training of models of individual visit intentions, increasing their performance. Another set of features comprises the user's visual kinetics, such as gaze position, velocity, and acceleration measurements in the Xand Y-coordinates of the interface. We hypothesize that visual kinetics features enable the prediction models to capture tendencies of the gaze at the beginning of a period, which would inform future eye movement patterns. This paper thus attempts to answer the following research question: • RQ: To what extent is it possible to accurately classify a user's visit intention in a certain period in real time to specific AOIs of a website, leveraging population-level general gaze patterns and a user's particular gaze data?
To answer this research question, 51 users participated in an experiment browsing a website with dynamic components in front of an eye-tracking device. This paper shows that by using the population-level gaze patterns of a few users and applying diverse classification methods and feature selection techniques, a user's visit intention to AOIs in a period can be predicted as proposed in RQ (average AUC = 0.843, average ACC = 0.79). Furthermore, individual users' visual kinetics features are consistently selected in every cross-validation set, confirming the hypothesis. This paper is organized as follows. Section 2 presents the related work. Section 3 defines key concepts, delivers a formulation of the problem to be addressed, and explains the apparatus used for the experimentation carried out to answer the research question. Section 4 explains the proposed prediction methodology. Section 5 shows an experimental evaluation by applying the proposed methodology to a real dataset of Web users' gaze data. Finally, Section 6 presents the discussions of the work carried out, while the paper is concluded in Section 7.

Literature Review
Visual attention is a mechanism that filters relevant areas of a stimulus from the rest of redundant information. Knowledge of a user's attention can be useful in applications such as coding and data transmission, improvement in recommendation and information delivery systems, and performance evaluation in different visual stimuli, among others.
In the literature, attention is categorized into two functions: bottom-up attention and top-down attention. The first corresponds to the selection of zones of a visual stimulus based on its most salient inherent characteristics in relation to the background. This function of attention operates with input in a crude, involuntary way. The second integrates knowledge of the visual scene, goals or objectives of the user with respect to the stimulus. Moreover, it implements longer-term cognitive strategies on the part of the person [15][16][17].
Studies focused on top-down attention models are less prominent in the literature than their bottom-up counterparts, since their functionality is influenced by factors external to the stimulus. In addition, calculated responses vary from model to model, i.e., from task to task, making it difficult to generalize a stimulus model. On the other hand, different models of bottom-up visual attention have been proposed. In this work, bottom-up models are studied.

Attention Models in Static Stimulus
The focus of classic studies on visual attention corresponds to static visual stimuli, that is, stimuli that do not present an alteration during the time period of exposure to a user. In this field, there are multiple studies on visual attention when making "free viewing" on images. Of these, there are mainly three approaches for the study of visual attention: saliency map generation, scanpaths models and the saccade models.
Saliency maps are topographic representations of the visual prominence of a stimulus. That is, they make it easier to understand areas of greater or lesser importance in the selection of points of attention. There are multiple studies in which saliency maps are applied to images. In [18], transformations are made from features present in different natural images to maps of salience, searching for a smaller number of characteristics to use. In [19], the images are convolved with a template and postprocessed to deliver the result. Association methods have also been utilized. In [20], a method called saliency transfer is presented, where a support set of best matches from a large database of saliency annotated images is retrieved, and transitional saliency scores are assigned by warping the support set annotations onto the input image according to computed dense correspondences. In other research, deep learning approaches has been adopted. In [21], a deep convolutional neural network is proposed that is capable of predicting fixations and segmenting objects in different images. In [22], an approach based on convolutional neural networks is employed which incorporates multi-level saliency predictions within a single network, capturing hierarchical saliency information from deep, coarse layers with global saliency information to shallow, fine layers with local saliency response. Other deep learning approaches applied for salient object detection have been reviewed in the literature [23,24] or for co-salient object (group of objects that occur together in an image) [25]. In general, there is a large database of images and different models developed in the state-of-the art approaches [26], as well as studies about their performance metrics [27,28].
Scanpaths are attempts to predict sequences of users' fixations within a stimulus. While many papers study scanpath generation and saliency map extraction independently, research focusing on the fusion of these approaches is limited. However, some works focus on the identification of scanpaths by using sampling methods on top of saliency maps [10,11]. Other studies have utilized fixation data for the calculation of saliency maps, such as the Attentive Saliency Network (ASNet), developed in [29].
Saccadic models represent another option. In [30], predictions of exploration routes are made in natural images. In [31], the saccadic model considers age ranges of users. This model integrates the interaction between the way in which users observe the visual information and the mental state of each user.
It has been demonstrated that static models present good results; however, in this work, we seek to study the visual attention of users in dynamic environments, which in general are more complex since they consider time as a new variable.

Attention Models in Dynamic Stimulus
There are models developed for bottom-up attention on dynamic stimuli, that is, environments that present movement or that change their structure when interacting in a certain way with the person. For example, a video presents alterations (frame-to-frame movement), but it does not depend on the interaction. On the other hand, a web page presents dynamism in its structure or content depending on the user's interaction.
In [1], predictions of saliency maps are made in videos, frame-by-frame. For this purpose, recurrent network methods with mixed density are used. In this study, the visual stimulus is known over time, thus allowing it to be broken down into a series of static images. However, the advantage of this approach, in which the saliency map for each frame of the video is generated by the information from the previous frames, is that information is integrated from the user interaction. Other studies have developed the evaluation of attention in videos, such as [32] which compares a model without video annotations, with annotations of foreground objects, and other biologically based annotations, or [33] where a deep learning approach was adopted.
Saliency maps have also been applied to stimuli in web environments. In [34], a saliency map is calculated at the task level. However, most studies have tended to focus on static calculations [10][11][12][13]. That is, they do not examine the temporal component of the user's interaction with the website. Finally, there is a dynamic visual attention study [14] in which the site is assumed to be known at all times, and at the same time, it is assumed to know the duration of each subtask conducted by the user.
Although these methods correspond to a dynamic stimulus and contemplate information of user interaction, they generally focus on knowing the stimulus at all times, either by its nature independent of the user or by predicting the duration of subtasks (predicting the structure of the stimulus in certain periods of time). Our approach seeks to generate predictions directly from visual data, without the need for predicting changes in the stimulus or its structure control at all times, seeking to accelerate the processing times of the model in execution.

Attention Models in Egocentric Vision
Egocentric vision is a recently introduced area of study. It details the analysis of attention in videos captured with wearable cameras on a user such that the visual fields during the execution of different tasks can be determined. In general, cameras are mounted on the head or chest of the user, and in addition to recording the environment in which it is carried out, the eye activity is recorded by an eye-tracker.
A recent study on gaze prediction in these environments is [8]. Here, the authors look for the generation of saliency maps in future frames through frame prediction, a similar approach to the one used in [9]. Although these environments are dynamic and depend on user interaction, the possibilities for interaction are so broad that traditional bottom-up models are insufficient for visual prediction [35]. Other models developed do not achieve generalization because they have specific cases for each task [36,37].
As in the case of the dynamic environments described above, the models consider the prediction of a visual stimulus structure or future sub-tasks for the estimation of visual attention. This is because the stimuli in this area are too complex, escaping our framework.
In particular, what we are looking for is to study a dynamic stimulus that moves within well-defined structures.
To the best of our knowledge, this is the first work to study attention transition between zones of a dynamic visual stimulus that changes according to the user's behavior, directly from user's visual information, without predicting changes in the stimulus structure.

Problem Definition
This paper studies the visual attention of a user on a dynamic stimulus, within a bottom-up attention environment, that is, where attention is guided by inherent properties of the stimulus, and not by factors internal to the user such as previous knowledge of the stimulus or specific tasks to be carried out (top-down attention). To study the above topic, a two-dimensional stimulus is created and is exposed to different users, and each interaction is recorded with eye-tracker instrumentation.
User visual behavior is represented by the recording of gaze data. In this way, each point of gaze x i is represented by a vector of three components x i = [x i , y i , T i ], where the first two correspond to the horizontal and vertical coordinates of each record over a two-dimensional stimulus, and the third corresponds to the time in interaction. In the following, a signal is referred to as the visual record of a complete interaction.
The eye-tracker records a user's visual signals. This gives us information of gaze, being able to identify fixations and saccades. A fixation is the state in which the eyes remain static, with some micro-movements, for a period of time (100-300 ms), for example, over a word when reading. They are identified as sets of gaze data, and their position is considered as the place where the user keeps their attention. A saccade corresponds to ballistic movements that have a duration between 30 and 80 ms, made between one fixation and another. This is not necessarily done in a straight line.
The development of this work is based on the segmentation of stimulus into AOIs, that is, regions where the researcher is looking for user behavior. The space between the AOI's is called whitespace, which is not necessarily free of stimulus elements.
For the AOI definition, there is an approach where the stimulus is divided into equal size regions, called segmentation in grid. On the other hand, AOI's can be defined in a semantic way, such that common meanings are used between zones assigned to the same AOI. There may be differences in shape and/or size with respect to other AOI's. We use the latter approach in this work given that it is expected to obtain information about the visual attention of the user in certain areas with a dynamic nature.
Therefore, an important factor in the problem is that for each AOI definition, it is necessary to consider grouping the zones of the stimulus that present the same information for the user. For example, grouping two news items within a newspaper allows the AOI to be characterized as news information. However, grouping an advertisement with one news item does not allow the AOI to be given a characteristic meaning.
This paper considers the visual attention of users to the different AOI's defined within a dynamic stimulus. For this, the concept of visit intention is defined, by which it is evaluated in advance such that the AOI visual stops by a user will be verified. In this way, given the visual stimulus, the AOI's {A j | j = 1...n} and a period of time v (t) = (T f ) are given, where the values of the tuple correspond to the time limits of the period, which are indexed by {t = 1, 2, ...}, and the visit intention Iv(A j , v (t) ) ∈ {0, 1} is defined as a binary variable that indicates whether or not there is a user decision to visit A j in that period. The decision to visit an AOI will be measured as the existence of visual fixations of the user within that stimulus zone. The visit intention for n AOI's in a period time is a vector It is assumed that the vector IV(v (t) ) is characterized at the beginning of the time period v (t) , although in reality, the process of choosing it can take place during all the time before the beginning of the window v (t) . What we are looking for in the following sections is the prediction of this vector for different time windows v (t) . Time window and period of time are used interchangeably throughout the paper, given the information of the user's behavior in the past time (periods v (s) , s = 1, . . . , t − 1).
What follows from this section describes the experiment developed to study the bottom-up visual attention of different users on a dynamic visual stimulus based on the estimation of the values of visit intention in different time periods.

Participants
The experiment is conducted with 51 participants based on a convenience sample. Of these participants, 34 are men and 17 are women, between the ages of 19-35. A large percentage of the participants (88%) are students, and the occupations of the rest include a research assistant, a participant with a bachelor's degree in Hispanic literature, and four engineers.
The participants signed an informed consent form stating that they permit the use of the data collected and meet the following requirements: • They are healthy individuals who do not present with diseases that could harm the results of the experiment, they are not under pharmacological treatment and they do not have a history of neurological or psychiatric diseases. • They do not present with the use of medications or drugs within the past 24 h. • They present with good vision or corrected vision.

Instrumentation
The eye-tracker Tobii T120 (Tobii Technology AB, Stockholm, Sweden) is used, which allows tracking of the subject's gaze during the test. This setup corresponds to a screen of 17 inches, with a resolution of 1280 × 1024 px, where two infrared diodes generate patterns of reflection on the corneas of the user's eyes; the three-dimensional position of each eye is calculated, and therefore, his or her gaze on the screen is determined. At the same time, the pupil sizes of both eyes are recorded over time. The Tobii Studio software allows the adjustment of the user's position during calibration.
The system has a sampling rate of 120 Hz, a typical accuracy of 0.5 Hz, a typical drift of 0.1 degrees, a typical spatial resolution of 0.3 degrees and a typical head movement error of 0.2 degrees.

Web Page
To carry out the study, a dynamic stimulus corresponding to an informative web page is used. This corresponds to new content for the user, as it is content written and edited by the developer, organized in a one page design. To achieve the dynamic character, the static and complete website version cannot be seen entirely on the screen, requiring the use of a scrolling bar by the user. At the same time, dynamic elements within the page, such as drop-down menus or moving elements, are considered.
The web page has different components, that is, the sections of the page that represent an element in its own functions and actions. Web page components can be seen in Figure 1 with different colors, and correspond to the following: "Logo"; "Bar_Menu"; "Title"; "Img" and "New" for each of the seven news items; "Banner", for each of the four advertisements; "Num_Pag" (corresponding to the pages number button); and "Sub_Butt" (corresponding to the user's subscription button). In addition, there are dynamic components, that is, they appear according to the user's interaction with the website. These are the country selection bar, the menu bar presented in Figure 2 and the user registration pop-up window in Figure 3.

Task Design
The objective of the experiment is to predict the attention zones of a user's gaze during the exposure of a dynamic stimulus in a bottom-up attention task. To perform this, a fictitious website is chosen, created specifically for the experiment, which is run locally to avoid possible delays in connection to the internet. The web page is detailed in Section 3.3.
A total of three websites are presented to each participant. Each considers the same structure in the elements of the page, but the content varies. Advertisements and news headers appear randomly and without repetition, out of a total of 12 available advertisements and 21 news items. This approach avoids the experience effect of a user with the stimulus and, at the same time, creates a dynamic stimulus with different content for each user. The aim is that users navigate freely on the website and that this, together with the lack of previous experience of the user with the page, allows the study of bottom-up attention.
The content that is chosen, both for news and announcements, is related to student life. Topics include travel, courses, beer, education, music, technology and miscellaneous.
For the best analysis of the data, the control of variables such as luminosity and the user's movement is required, such that an isolated space is necessary to carry out the study. For eye tracking, direct sunlight, which affects the quality of the measurements, should be avoided. In addition, there should be no light from a top source. To avoid these effects of light, the laboratory is isolated from external light with black-out curtains. In addition, measures are taken to avoid all kinds of external interruptions during the experiment.

Experimental Procedure
During the experiment, only the person in charge of the test and the participant are present in the experimental room. The latter is provided with an explanation of the experiment and asked to read and sign the informed consent form.
The person sits in front of the eye-tracker screen in a manner similar to navigating on a desktop computer, allowing the use of a mouse. The participant is asked to maintain a fixed posture, without moving their head, to avoid data loss.
Each person is prompted to freely browse the web page for as long as they want and to indicate when to finish.
Prior to the tests, each user is subjected to a relaxation stage consisting of the visualization of three videos of four minutes each, consisting of picturesque landscapes with background instrumental music. The second part of this stage asks the participant to take a deep breath for one minute with closed eyes and soft instrumental music in the background. This stage aims to eliminate the Hawthorne effect (behavior modification as a consequence of knowing they are being studied) and physiological effects. In addition, this stage allows the baseline to be acquired for later treatment of the signals. Then, the page is presented randomly to each participant.

Descriptive Analysis
A total of 153 signals are extracted from 51 volunteers for the signals of the gaze behavior and event registration (click, display of dynamic elements, scroll, etc.). Fixations with duration less than 100 ms are filtered from the model. Table 1 shows descriptive statistics of the acquired signals. Table 1. Descriptive statistics of the acquired signals.

Number of records 153
Recording time (s) Number of saccades Fixing duration per record (s) Figure 4 shows the methodology for the construction of predictive models of visit intention in limited periods of time. First, components of the web page are established, and then, AOI's are defined. Afterwards, signal records are divided into time windows of a specific length by solving a linear optimization problem. The time windows are then optimally assigned to training and test sets for cross-validation. Labels for each time window are created, and a oversampling balance method is used for each train-test set to avoid overtraining. For each time window, it is possible to create a set of features used for classification. Different methods are tested, and finally, the results are evaluated.

AOI Definition Criteria
Given the approach of the problem, each defined AOI must have a characteristic meaning; therefore, components of similar semantic meaning were grouped into larger areas. At the same time, the final size of each defined area is considered so that it can be fully visible on the screen by scrolling the page in the corresponding dimensions.
In this way, the web page is separated into five semantic areas shown in Figure 5. A sixth AOI is defined, corresponding to the registration menu that is displayed when a user clicks on the button intended for this functionality, located in the upper navigation bar (see Figure 3). AOI 1 is dynamic, as the menu bar moves together with the screen movements given by the scroll. The AOI's 2 and 3 are groupings of outstanding news, and the division into three and four news groups is chosen since the resolution of the screen allows these AOI's to be completely displayed when performing a lateral scroll. Following the same logic, AOI's 4 and 5 are generated corresponding to advertising banners within the web page. Firstly, the areas of interest are defined as sets of components within the website. Subsequently, the signals captured experimentally are segmented into constant length time windows, which are assigned to each of the sets of the k-fold cross validation. As the target variable for each window, a binary label is generated that is one if the area of interest is visited in that temporal window or zero if not, and subsequently MLSMOTE techniques are used for the balance between the different categories. Later, for each time window, a set of features is associated with which the multilabel classification models with k-fold cross validation of MLKNN, Ridge Regression, K-nearest neighborhood (KNN) and support vector machine (SVM) are trained, where the latter three models are trained under the binary relevance and classification chain approaches. Finally, the metrics shown in the "Result evaluation" box are calculated to choose the best model. AOI 6 has a dynamic nature and is activated by user interaction within the site. When activated, the ocular fixations for the rest of the AOIs located below are deactivated.
Some components of the page are not included in any AOI, as is the case for the subscription bar, the number of pages and the title of news; these are considered to be whitespace.
Thus, each defined AOI meets the conditions of the problem, where AOI 1 acquires the meaning of the navigation menu, areas 2 and 3 refer to news, areas 4 and 5 to advertising and AOI 6 to the user registration menu. Figure 5. AOIs for the web page. AOI1 comprises the page logo and the navigation bar at the top of the web page. AOI2, located at the center of the web page, is composed of the titles of three news articles and their respective images (Img 1, Img 2, and Img 3). AOI3 is composed of the titles of four news articles and their images (Img 1, Img 2, Img 3, and Img 4). AOIs 4 and 5 consists of two advertisements each one, located at the right-hand side of the web page. AOI6 corresponds to the user registration pop-up, which is a dynamic component.

Time Window Segmentation and Cross-Validation
The models developed with population-level features simulate the continuous acquisition of gaze data over time. To evaluate this acquisition, the gaze signal is segmented into time intervals called windows. These windows have a long τ, given as a parameter. Thus, a long signal T will have t = 1, . . . , T/τ time windows v (t) .
To achieve external validity of the prediction models, it is necessary to generate training and test datasets that maintain independence from each other. To achieve this generalization, it is considered that both sets cannot share information from the same user. Thus, using the k-fold cross-validation method, a partition per user is performed. In turn, to select which users belong to each of the k-folds, an optimal allocation is performed that balances the number of windows in each fold. Since each user's total browsing time is different, the number of users in each partition varies.
The assignment is made with a mixed integer linear programming model given by Equations (1a) to (1f). Sets U = 1, . . . , N and KF = 1, . . . , K represent the number of users and fold to be used in cross validation, respectively. Variables {y u,k |u ∈ U; k ∈ KF} are defined with the value one if the user u is assigned to fold k; otherwise, they are zero. The variables {x k |k ∈ KF} represent the sum of windows in fold k, given the assignment of users. Thus, each user is assigned to a fold, minimizing the difference between the total sum of windows in each fold k and the average number of windows in all folds. In the best case, all folds will have a number of windows equal to the averagex.
The restriction (1b) indicates that each user can be assigned to only one of the kfolds. The restriction (1c) defines the variables x i , each representing the weighted sum between the assigned users and the total number of windows it has in the interactions (P u ). The restriction (1d) defines the arithmetic mean of the number of windows in the folds.

Sample Balance
In multilabel classification problems, it is common that label frequencies are not the same, with some that are more frequently active than others. In this case, since labels are defined by combinations of AOIs, vectors will exist with few combinations, as in the cases in which the dynamic areas of interest are activated, such as the pop-up windows, which are carried out to a lesser extent than the activation of the static AOIs.
The MLSMOTE [38] resampling method is used to solve the sample imbalance problem. This method uses the minority class labels as seeds for the generation of new instances. For each minority class label, it searches for the nearest neighbors by obtaining the characteristics of the new samples using interpolation techniques. Therefore, the new instances are synthetic rather than clones of other data.

Features Extraction for Multilabel Classification
For multilabel classification, a set of independent, population-level features is generated, which are processed from gaze data signals and navigation history. Indexes j = 1, . . . , n correspond to the defined AOI's and c = 1, . . . , N to the components present in the web page.
The features related to the ocular fixations carried out in each component and AOI are extracted, which are defined in Table 2. These features attempt to capture attention in certain areas given the history of previously visited areas. The "time since last visit" features for each AOI are calculated as the difference between the start time of the current window and the time of the last time that user make a fixation in the j-th AOI ({AOI} j = 1). Value 0 in case of no previous visit.

End_r1
Corresponds to an AOI indicator in which the previous window ends. It is calculated as: The feature "Heat AOI" adds information about the behavior of other users to the behavior of a particular participant. It is expected that for a stimulus, the model will learn to recognize the visual attention patterns of users over time, at the population level. In this way, the weighting of this feature in the classification models allows learning how to use these patterns in predictions.
Subsequently, visual kinematics features are defined, shown in Table 3. Position, velocity and acceleration measurements are calculated in the X-and Y-coordinates of the gaze position over the stimulus. These features seek to incorporate information in relation to the shape of the ocular movement carried out in a period of time before the beginning of the window, which would allow knowing the future movement patterns. The definitions for the Y coordinate features (coordY, MeanY, StdY, VelY, MeanVelY, StdVelY, AclY, MeanAclY and StdAclY) are similar to those of the X coordinate case.
Finally, features related to the eye-tracking function, both fixation and saccade, are added (see Table 4). These features capture the areas to be visited according to the ocular movements registered for the user.

Feature Definition
coordX It is considered the last position of the previous time window, before starting the time window, registered for the user.

MeanX, StdX
Mean and standard deviation of the gaze position in the X-coordinate recorded in the user interaction with web page in the previous time window.

VelX
Derivative from position data at X-coordinate. This feature takes the velocity value corresponding to the last register of the previous time window.

MeanVelX
Mean and standard deviation of the speed in X, of the data recorded in the previous time window. StdVelX

AclX
Derivative from speed data at X-coordinate. This feature takes the value corresponding to the last acceleration register during navigation.

MeanAclX
Mean and standard deviation of the acceleration data in the X-coordinate recorded in the previous time window. StdAclX Table 4. Features in Eye-Tracking Function.

Feature Definition
NFix Number of fixations made by the user during their interaction with web page along time.

TPromFix
Average time, maximum time and minimum duration of the fixations made by the user during navigation.

TMaxFix TMinFix NSac
Total number of saccades made by the user up to the time before the start of the current time window.

APromSac
Average, maximum and minimum amplitude of the saccades taken by the user up to the moment before the beginning of the current window. AMaxSac AMinSac

Performance Metrics
Models are evaluated by k-fold cross-validation, so the evaluation metrics are independently averaged between the k training and testing sets. To measure the quality of the solution and choose the best configuration, the following metrics are used. Y i corresponds to the real label vector of the visit intention components, and Z i corresponds to the prediction visit intention vector. Only in this case does n represent the number of instances to evaluate.
The problem is originally seen as the search for a user's attention zones over time. This approach allows leeway such that the model does not have to achieve the full prediction of the area served by the user. For this reason, the recall metric is not used directly for the evaluation of a model, which establishes the percentage of activation in the labels that are predicted, but it is used harmoniously in the F-measure metric. On the other hand, it is expected that in cases where the model establishes tag activation, these will actually be areas served by users, minimizing the error to the maximum degree. For this reason, the precision metric, which measures the percentage of activation occurrences that actually happen within the total activation proposed by the model, is used directly in the model selection.
Area under the curve (AUC): A receiver operating characteristic (ROC) curve demonstrates a classifier's ability to discriminate between positive and negative values by changing a threshold value. When the area below the curve is one, it represents a perfect prediction. In the multilabel classification case, it can be calculated from the macro approach, using the rankings of the instances for each label. Ranking r(x i , l) is defined as the function for which the instance x i and the label y l returns the degree of confidence of l in the prediction Z i . In this way, the calculation is given by: Accuracy (Acc): This is the ratio between the number of correctly predicted labels and the actually active labels, given individually by components. It is averaged across all instances: : This is the instances percentage, where all coordinates of each vector that are correctly labeled are compared to the total number of labels, that is, the percentage of instances correctly predicted. In the below formula, the operator corresponds to the bracket of Iverson, which is 1 if the logical proposition is true, and 0 if not.
The ratio between the number of correctly classified labels and the total number of labels. Intuitively, it corresponds to the percentage of labels predicted as true and that are really important for the instance.
It is the average between precision and recall calculated in a harmonic way. It measures how many relevant tags are predicted and how many of the predicted labels are relevant.
F-measure = 2 · Precision · Recall Precision + Recall where Recall is the percentage of correctly predicted labels among all labels:

Methodological Limitations
One of the foremost limitations of this study is the utilization of a stimulus in the form of a free visit to a web page on a desktop computer, which cannot be extrapolated to mobile devices. This is a critical consideration, as the use of mobile devices is on the rise and can significantly impact the browsing experience on a web page, due to variations in screen size, modes of manipulation, and the user's field of view. Additionally, while the free visit task may provide a general understanding of page navigation, it is possible that the results would differ if a different task were employed.
While efforts were made to include a diverse range of participants, it is important to acknowledge that the results may be subject to variation if the sample were more diverse in terms of demographic factors such as age or professional status. Additionally, it is important to note that the results of this study may not be generalizable to other websites that possess distinct characteristics, where interactions may present alternative types of dynamic components. Furthermore, the areas of interest defined in this study are specific to this particular website and may not be applicable to other websites with different components or distribution. However, the methodology proposed in this study can be replicated in other websites. Therefore, it is essential to continue research in this area in the future.

Experimental Results
Classification models are executed with different configurations, making a crossvalidation of 10 folds. The average results measured in the test sets are given in Section 5.1. Test set samples are not balanced, since synthetic data classification is irrelevant. The results therefore correspond to the original distributions of the labels.

Main Results
The proposed configurations are executed using the programming given in [43] ("MLC Toolbox"). The results are presented in Table 5, where the best models are summarized (choosing the number of features and parameters) as detailed below:    A total of ten features are listed. It is observed that all sets require the use of visual kinetics features of gaze position at the beginning of the time window. In addition, it is observed that the average Y-coordinate recorded in the past is sometimes required. Subsequently, it is required to know the activation of any of the AOIs in the previous time window, or the information about the AOI in which the previous window ends. It is interesting to consider that the model only requires knowing the activation of AOIs in a window of time in the past and not the full activity of user interaction. In the sets in which component information is required, the use of the features that indicate the activation of the dynamic components of the website in the previous time window is observed, that is, the menu bar and/or the pop-up registration.
Upon studying the behavior in each of the AOI's (right part of Table 5), it is observed that the model chosen presents the best accuracy values for three AOIs and approaches the best results of the other models for the other AOI's.
The results obtained are validated by comparing them with the biases towards the majority class, which is the proportion of labels with more repetitions with respect to the total. In the formula, m is sample number, MC is majority class and the function inside the argmax calculation is the majority class ratio.
For each AOI, the MC ratio values are 0.653, 0.567, 0.513, 0.629, 0.727, and 0.954, for AOI's 1, . . . , 6, where the majority class corresponds to the activation of each AOI (1 label), except in the case of AOIs 2 and 3 (0 label). At the same time, the bias to the majority class when considering the complete instance is 0.133 (considering the combination of labels of each AOI of a vector as a label), which corresponds to the activation of AOIs 2 and 3. Therefore, at the individual level and as a whole, the model improves the trivial case of always choosing the majority class.

Sensitivity Analysis by Time Window Lengths
Sensitivity analysis is carried out, given variations in the time windows' duration to be considered in classification. With this analysis, it is expected that the robustness of the methodology can be evaluated and used for the prediction of visit intentions.
First, models are run with all the combinations again, considering the duration of time windows with τ = 3, 10, 15, 20 s. Table 6 shows the best settings for each prediction model. Table 7 shows the metrics obtained by the best models for each combination.  Table 7. Sensitivity analysis metrics on the duration of time windows for the best models.

Time Window Duration (Seconds)
Model and Metrics τ = 3 s τ = 5 s τ = 10 s τ = 15 s τ = 20 s  It can be seen that the settings chosen for each model are not the same for all the values of the temporary duration of the window (τ). The best results are obtained with the use of an RR-BR model for longer time windows. It is interesting to see that the case of τ = 3 obtains a better Exact metric.

RR-BR
First, the values of the subset accuracy (Exact) and accuracy (Acc) metrics are inspected for different values of τ. Figures 7 and 8 shows the graphs of these values for each classification model configuration. From these metrics, it can be seen that if a model with the best estimates of the activation of each AOI is desired, then the case of a longer duration in the time windows (τ = 20) is preferred, since in this case the accuracy (which measures performance by the individual label of each AOI) increases in all models. However, when studying the subset accuracy, that is, the correctness in the estimation of the activation of labels in a vector way (all the AOI's within a time window), the duration of the most extreme time windows is preferred, that is, this metric increases with time windows of shorter duration (τ = 3) or longer duration (τ = 20). To analyze the above, it can be considered that if all models achieve the same level of training (which is not necessarily true), the vectors of the visit intention in smaller and larger time windows show less variability. As the size of the window increases to infinity, the variability in the visit intention tends to zero since it is expected to take the value 1 in all AOI's (although this may not be true in cases where AOI 6 is not visited). In the case of a very small window (tending to zero), the intention to visit all AOI's should tend to zero as there is no time to make transitions to other AOI's.
Subsequently, to better understand the results, the precision is studied. In Figure 9, it is observed that the values for precision appear to be stable according to the lengths of the windows, with the exception of the classifiers based on ridge regression, which tend to decrease slightly more when choosing τ = 15. In all cases, precision values greater than 0.8 are achieved, a value commonly considered to be good in the literature. This result means that in the label activation predicted by the model, more than 80% of the cases are performed correctly. In the case of τ = 20, it achieves better accuracy, which may be related to the fact that there are more active labels (as in the case of windows tending to infinity discussed in the previous paragraph), which causes the existence of a greater number of true positive values.
In the case of the F-measure values (see Figure 10), a high trend is observed as the duration of the time windows increases. This result indicates that for longer time windows, greater harmony is achieved between precision and recall. Finally, when evaluating the area under the curve (AUC) graph in Figure 11, it can be observed that it decays for values with longer time windows. This observation indicates that the number of times that the ranking value (confidence given to a certain label) of the positive labels (corresponding to the activation of an AOI) exceeds the ranking for a negative label (that is, no activation of an AOI) is lower. Therefore, the power of discrimination between values predicted as positive (true positives) and negative values that are classified as positive (false positive) is lower in longer time windows. The drop in AUC values shows that longer time window models have worse discrimination between classes. However, in all cases, values above 0.8 are obtained, which is considered good in the literature.
Given the above, it is concluded that at longer time windows, the accuracy and subset accuracy will tend to improve, which does not necessarily reflect a good model performance. It is necessary to consider the discriminative power between classes of the model, for example, from metrics such as AUC. On the other hand, it is important to consider the exploratory analysis of the labels (individual and vectorial) in each case (tau chosen), since it is necessary to make a trade-off between the variability of the labels and the time used to make the predictions.
From previous results, it is considered that the estimation of shorter windows is better, since it gives more detailed information of the visual attention in a better discretized time. However, to ensure that the results of the model are valuable, the use of τ = 3 must be evaluated both with respect to the time required by the model for its execution (which does not exceed the duration of the corresponding window) and the variability of the label sufficiency. By the previous criteria, the model from the beginning was considered with τ = 5 as the main model.

Time Overhead Calculation
The window processing time for the multilabel classification model chosen is given by the signal processing used in the prediction (cleaning of the signals and characteristics calculation), in addition to the execution of the classification.
It is interesting to study the classification times given the duration of the time windows, since on the same equipment, the methodology of the algorithm indicates the speed of computation. The processing times of characteristics will depend on the amount chosen by the model in each of the classifiers considered, according to the duration of the windows, in addition to the processing equipment.
To compare classifiers' performance, all combinations are run to classify windows of different time duration. In particular, windows of length τ = {3, 10, 15, 20} s are considered, in addition to the case of time windows of 5 s. Table 8 shows the mean and standard deviation of execution times of the classifiers in seconds. Execution times for ridge regression models are less than one millisecond in the test sets. This outcome is because the model is based on the storage of the weighting parameters of each variable and only requires a multiplication with the features of the test sets to obtain the result.
Most expensive computational models correspond to those using the KNN base classifier. This is because for a new window in the test set, its classification requires searching for the neighborhood in the training set according to the size given by the parameters of the model.
In Figure 12, it is observed that the execution time decreases with increasing time window length; this effect occurs because with longer times, test sets are generated with a smaller number of temporary window samples.
Finally, it is observed that the classification times are less than the duration of each time window present in the test sets, so they are applicable in an online environment. This approach considers many windows, where the averages in the test sets are 398, 239, 120, 80, and 60 for duration τ of 3, 5, 10, 15 and 20 s, respectively.

Discussion
This paper proposes a method to predict users' visual attention on specific regions of a website called visit intention. This method has been developed in a controlled website where the web objects and their content are known. However, its layout varies, under certain limits, according to the user's interaction. The proposed method for predicting the visit intention does not require knowledge of the website's structure at all periods. However, the underlying idea is that the model learns to recognize the structure implicitly according to the data of the user's behavior and a compilation of visual attention in previous periods.
The trained multilabel classification models allow for predicting visit intentions by jointly incorporating the population-level features of all the AOIs present on the website within each period. In this way, these models comprise a user's visual attention as a whole in all AOIs of the website. The prediction is mostly made by using knowledge of the user's past behavior. In particular, the models use position kinematic features, which consist of the location of the user's gaze at the end of the previous period. With this, the model is encouraged to learn the closest AOIs for the user in a new period based on the current position. The results deliver a total of 10 characteristics in each of the training-test partitions (see Figure 6). In this model, no eye-tracker feature function variables are used, and of the rest, 73% correspond to eye-fixation characteristics and 27% to kinematic characteristics. Indeed, other multilabel classification models can be evaluated, as can neural network models based on deep learning, which is part of our future work.
One limitation of our visit intention approach is that, because of its experimental control, it is not suitable for use in predictions of visual attention in egocentric viewing environments, such as those analyzed in Section 2.3. The reason is that in these environments the stimulus, such as the frames of a video, tends to be less controlled and can suffer considerable variability.
A second limitation is that the study is carried out for the case of bottom-up attention, because users browse a previously unknown website. Therefore, the challenge of predicting visit intention in top-down cases remains open. The reason is because in these cases users tend to seek out certain information in specific regions of the website, depending on clearly defined objectives. At the same time, their website browsing behavior depends on their knowledge about the layout that each user possesses (learned from interactions with other similar websites).
On the other hand, the sample of participants included in the experimentation corresponds, for the most part, to a convenience sample of engineering students, who one would expect to demonstrate greater fluency in website browsing. The study of results for other types of users would allow a better generalization of the models and, at the same time, could provide useful information regarding their improvement, for example, when considering clusters of similar users in the estimation of parameters. Moreover, this sample corresponds to people from the same location with shared cultural characteristics and native speech corresponding to that of the site studied.
Bearing in mind the above limitations, and although a specific website has been used, a proof-of-concept for the proposed method was achieved. Nevertheless, it is necessary to evaluate these models for other stimuli, which would ensure their generalization, depending on the quality of the results obtained.
Regarding the scalability of the solution, it has been demonstrated that the execution of the classification models takes less time than the duration of the periods that are being predicted. The transmission of data and the calculation of features from the eye-tracker data are factors that must be considered in the selection of the period lengths; however, in this paper, these factors have not been studied in greater depth. Depending on the results of such a scalability study, the use of these approaches in the prediction would allow considering other types of platforms, such as applications in mobile web environments and tablets. This requires experimentation with new visual stimuli in a controlled manner and under these new environments.

Conclusions and Future Work
A predictive model is proposed for the concept of visit intention, providing information on the areas of interest of a website that will be visited by a user over a given time period. A proof-of-concept of the proposed method has been achieved by experimenting with a specific informative website that involves dynamic components. The results demonstrate the feasibility of creating these predictive models and test the hypothesis stating that the considered features are useful predictors of visual attention.
The proposed methodology bears an advantage in that it is not required to know the structure of the website in each period, as in other studies reported in the literature; rather, learning the dynamic changes through the gaze data registered for a user, estimating models through population behavior as characteristics and employing a user's visual kinetics to train individual models for each instance of prediction are utilized. The model is applicable from an analysis of execution times, and criteria have been provided in the selection of the window duration used in the segmentation of the total time of user interaction, preferring the use of windows of τ = 5 s for this case.
As future work, it is expected that these results can be integrated into more advanced models of visual attention, in addition to the study of improvements based on models for segmentation of users and the use of deep learning methods. On the other hand, it is necessary to explore in greater detail the impact of the variables used in the models, to find explainable patterns of the relationship with visit intentions. Furthermore, it is interesting to consider the study of the methodology of this style in other visual environments (e.g., mobile devices) and in cases related to specific tasks (e.g., searching for a news item on the site).

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to their containing information that could compromise the privacy of research participants.